Method for determining biological expression levels by linear programming

ABSTRACT

A method for determining a matrix of expression levels corresponding to a set of biological targets (e.g., genes or gene fragments) and a set of biological samples, including obtaining a matrix of signal values corresponding to the set of biological targets; computing a vector of expression levels for a sample in the set of biological samples using the matrix of signal values; storing the vector of computed expression levels in a storage matrix; repeating the computing and storing steps for each sample in the set of biological samples; and outputting the storage matrix as the matrix of expression levels. The method, based on a linear programming formulation of the problem, works for both “promiscuous” probe array data, in which there may be multiple targets indicated by a single probe, and the “polygamous” case, in which there are multiple probes for a single target. The preferred method can also process data obtained from multiple SAGE analyses using multiple markers. A second embodiment of the method determines optimal expression levels when the available probe data is noisy or uncertain.

BACKGROUND OF THE INVENTION FIELD OF THE INVENTION

[0001] The present invention relates generally to systems and methodsfor the determination of biological expression levels present in abiological sample using data obtained from biological expression arrays.

[0002] The present invention includes the use of various technologiesdescribed in the references identified in the following LIST OFREFERENCES by the author(s) and year of publication and cross-referencedthroughout the specification by reference to the respective number, inparentheses, of the reference:

LIST OF REFERENCES

[0003] [1] BERTSEKAS, D. P. Nonlinear Programming, 2^(nd) Edition.Athena Scientific, 1999.

[0004] [2] BRAS, M., CLOAREC, J., BESSUEILLE, F., SOUTEYRAND, E.,MARTIN, J., AND CHAUVET, J. Control of immobilization and hybridizationon DNA chips by fluorescence spectroscopy. Journal of Fluorescence 10(2000), 247-253.

[0005] [3] CANTOR, C. R., AND SMITH, C. L. Genomics. Wiley, 1999.

[0006] [4] CLARK, T., LEE, S., SCOTT, L. R., AND WANG, S. M.Computational analysis of gene identification with SAGE. J. Comp. Bio.,accepted.

[0007] [5] HERNAIZ, M., LIU, J., ROSENBERG, R., AND LINHARDT, R.Enzymatic modification of heparan sulfate on a biochip promotes itsinteraction with antithrombin iii. Biochemical and Biophysical ResearchCommunications 276 (2000), 292-297.

[0008] [6] KANE, M. D., JATKOE, T. A., STUMPF, C. R., LU, J., THOMAS, J.D., AND MADORE, S. J. Assessment of the sensitivity and specificity ofoligonucleotide (50 mer) arrays. Nuclei Acids Research 28, 22 (2000),4552-4557.

[0009] [7] LI, C., AND WONG, W. H. Model-based analysis ofoligonecleotide arrays: expression index computation and outlierdetection. Proc. Natl. Acad. Sci. U.S.A. 98 (2001), 31-36.

[0010] [8] LI, C., AND WONG, W. H. Model-based analysis ofoligonecleotide arrays: model validation, design issues and standarderror application. Genome Biology 2 (2001), 32.1-32.11

[0011] [9] LIPSHUTZ, R. J., FODOR, S. P. A., GINGERAS, T. R., ANDLOCKHART, D. J. High density synthetic oligonucleotide arrays. NatureGenetics 21 (1999), 20-24.

[0012] [10] LUEKING, A., HORN, M., EICKHOFF, H., BUSSOW, K., LEHRACH,H., AND WALTER, G. Protein microarrays for gene expression and antibodyscreening. Analytical Biochemistry 270 (1999), 103-111.

[0013] [11] MACBEATH, G., KOEHLER, A. N., AND SCHREIBER, S. L. Printingsmall molecules as microarrays and detecting protein-ligand interactionsen masse. J. Am. Chem. Soc. 121 (1999), 7967-7968.

[0014] [12] MACBEATH, G., AND SCHREIBER, S. L., Printing proteins asmicroarrays for high-throughput function determination. Science 289,5485 (2000), 1760-1763.

[0015] [13] Walter, et al., S. L. Protein arrays for gene expression andmolecular interaction screening. Curr. Opin. Microbiol. 3 (2000),298-302.

[0016] [14] MARK SCHENA, E. Microarray biochip technology. EatonPublishing, 2000.

[0017] [15] NOCEDAL, J., AND WRIGHT, S. J. Numerical optimization.Springer, 1999.

[0018] [16] PEYRET, N., SENEVIRATNE, P. A., ALLAWI, H. T., ANDSANTALUCIA, JR., J. Nearest-neighbor thermodynamics and NMR of DNAsequences with A-A, C-C, G-G, and T-T mismatches. Biochemistry 38(1999), 3468-3477.

[0019] [17] PILARSKY, C. P., SCHMITT, A. O., DAHL, E., AND ROSENTHAL, A.Microarrays—chances and challenges. Curr. Opin. Molecular Therapeutics 1(1999), 727-737.

[0020] [18] SCHENA, M., SHALON, D., DAVIS, R. W., AND BROWN, P. O.Quantitative monitoring of gene expression patterns with a complementaryDNA microarray. Science 270 (1995), 467-470.

[0021] [19] UETZ, P., GIOT, L., CAGNEY, G., MANSFIELD, T. A., JUDSON, R.S., KNIGHT, J. R., LOCKSHON, D., NARAYAN, V., SRINIVASAN, M., POCHART,P., QURESHI-EMILI, A., LI, Y., GODWIN, B., CONOVER, D., KALBFLEISCH, T.,VIJAYADAMODAR, G., YANG, M., JOHNSTON, M., FIELDS, S., AND ROTHBERG, J.M. A comprehensive analysis of protein-protein interactions inSaccharomyces cerevisiae. Nature 403 (2000), 623-627.

[0022] [20] VO-DINH, T., AND CULLUM, B. Biosensors and biochips:advances in biological and medical diagnostics. Fresenius Journal ofAnalytical Chemistry 366 (2000), 540-551.

[0023] [21] WARRINGTON, J. A., DEE, S., AND TRULSON, M. Large-scalegenomic analysis Affymetrix GeneChip (R) probe arrays. In Microarraybiochip technology. Eaton Publishing, 2000, pp. 119-148.

[0024] [22] WRIGHT, G., CAZARES, L., LEUNG, S., NASIM, S., ADAM, B.,YIP, T., SCHELLHAMMER, P. F., GONG, L., AND VLAHOU, A. Proteinchip(R)surface enhanced laser desorption/ionization (SELDI) mass spectrometry:a novel protein biochip technology for detection of prostate cancerbiomarkers in complex protein mixtures screening. Prostate Cancer andProstatic Diseases 2 (1999), 264-276.

[0025] [23] YERSHOV, G., BARSKY, V., BELGOVSKIY, A., KIRILLOV, E.,KREINDLIN, E., IVANOV, I., PARINOV, S., GUSCHIN, D., DROBISHEV, A.,DUBILEY, S., AND MIRZABEKOV, A. DNA analysis and diagnostics onoligonucleotide microchips. Proc. Natl. Acad.Sci. U.S.A. 93 (1996),4913-4918.

[0026] [24] LEE, S., CLARK, T., CHEN, J., ZHOU, G., SCOTT, L. R.,ROWLEY, J. D., AND WANG S. M. Correct Identification of Genes fromSerial Analysis of Gene Expression Tag Sequences. GENOMICS Vol. 79.Number 4, (April 2002), 598-602.

[0027] [25] SAGEMAP: SERIAL ANALYSIS OF GENE EXPRESSION TAG TO GENEMAPPING. National Center for Biotechnology Information (NCBI).www.ncbi.nlm.nih.gov/SAGE/.

[0028] [26] VAN DAM, R. M., AND QUAKE, S. R. Gene Expression Analysiswith Universal n-mer Arrays, Cold Spring Harbor Laboratory Press ISSN.(2002), 145-152.

[0029] [27] ZHANG, L., ZHOU, W., VELCULESCU, V. E., KERN, S. E., HRUBAN,R. H., HAMILTON, S. R., VOGELSTEIN, B., AND KINZLER, K. W. Geneexpression profiles in normal and cancer cells, Science 276 (1997),1268-1272.

[0030] The entire contents of each reference listed in the LIST OFREFERENCES, are incorporated herein by reference.

DISCUSSION OF THE BACKGROUND

[0031] Biological expression arrays are introducing a type of massivelyparallel processing in experimental biology and medicine [17] [20]. Geneexpression arrays based on cDNA's [18] and on oligonucleotides [21] havereached a level in which the technology is well established, with booksdevoted to the subject [14]. Current research includes efforts to refinethe technology [2].

[0032] More recently, protein expression arrays have been developed[12][13][22][19], and utilized to identify antibodies [10]. Otherexamples of biological expression arrays include small moleculeexpression arrays [11] as well as other specialized arrays [5].

[0033] Biological expression arrays are arrays of small biochemicalexperiments, each of which can be different from the other. Each “dot”on the array contains a reactant, called a probe. The set of all probescan be tested against a sample, which presumably contains a set ofso-called targets which are to be determined, together with theirquantity, or “expression level.” For gene expression arrays, the targetsare genes or gene fragments (called expressed sequence tags, or ESTs).The “expression levels” could as well be ratios of other more basicquantities, such as the ratio of the observed value for a perfect matchand a mismatch [7][8].

[0034] The process of matching probes and targets for gene expressionarrays is called hybridization. If the hybridization is perfectcomplementarity, the product may rather be called a perfect complementrather than a hybrid. But when there are some complementaritymismatches, as is the general case, it is appropriate to think of theprobe and target as coming from different genes, so the word hybrid ismore appropriate. There is a difference between levels or detectionsignals of hybridization values (the input data) and expression values(the desired answer). In some cases, these are in a one-to-onerelationship, so both are often called “expression” values, but ingeneral they are not related in a simple way.

[0035] Arrays based on oligonucleotide probes of various lengths havebeen widely (often) used. A string of n bases is referred to as ann-mer. In the literature, one finds data from using 8-mers [23], 25-mers[21], 50-mers [6], and even longer oligonucleotides probes are currentlyin use as discussed on web sites. DNA-arrays based on probes consistingof significantly longer sequences (a few hundred bases) when synthesizedvia template-dependent enzymatic reactions are often called cDNA arrays[18]. A positive hybridization signal from the array is assumed to meana complementary match (or near match) of sequences between the targetsand the probes.

[0036] One successful approach to gene expression determination is thatadopted by Affymetrix, Inc. (Santa Clara, Calif.) [21]. In theirtechnique, oligonucleotides from several parts (loci) of a known geneare used to define array probes. This redundancy insures a high degreeof certainty in identifying the expression of that gene, allowingprecise discriminative detection between perfect matching andmismatching of probes to hybrids. (One presumes that care is also takento be sure that the resulting oligonucleotides are unique and do notalso appear in other relevant, either known or unknown genes or ESTs,which, in general, is impossible to be absolutely certain about.) Foreach perfect-match probe, a “mis-match-probe” is used which differs fromthe perfect-match probe by one single base, typically chosen in themiddle of the oligonucleotide. Thus there should be an expected relation[3][16] between the expression level for a perfect-match complementaryprobe and its corresponding mis-match-probe. If these conditions are metfor most, or many, of the probes for a given gene, it is highly likelythat gene has been expressed.

[0037] A drawback to this approach is that multiple probes are requiredto determine a given gene. This approach is considered “polygamous” inthat there are multiple probes presented on the array for a singletarget detection. When the expected relationship between thehybridization levels detected for a probe and its correspondingmis-match-probe are not met, the data for these probes is regarded asnoise. There are many potential causes of such “noise” in expressiondata. For example, if some genetic contaminant is present thatcomplementary matches exactly the mis-match-probe, the correspondingexpression level for the “probe” (i.e., the perfect-match probe) wouldbe of a “mis-match” level, inverting the signal ratio.

[0038] It would be advantageous to allow for more complex scenarios inwhich all of the discrete probe data is used for expressiondetermination, rather than segregating them into groups and thuslimiting the data from a group of discrete values to single-valuedindications for a particular gene. This approach would also reduce theneed for the probes to be “unique” markers for particular genes.

[0039] It could be possible to interrogate more genes even than thenumber of available probes, although this would not be possible if allgenes were present in a given experiment. This would be done by having aparticular probe to indicate the presence of multiple targets, albeitambiguously. The new approach is “promiscuous” in that a single probecan indicate presence of multiple targets. In the promiscuous approach,all hybridization data becomes valuable signal. Complex hybridizationscenarios (e.g., multiple-base mismatches) can (potentially) be includedproductively in the array hybridization data (signal) analysis.

[0040] Serial Analysis of Gene Expression, or SAGE™, is a techniquedesigned to take advantage of high-throughput sequencing technology toobtain a quantitative profile of cellular gene expression[24][27][25][4]. A more detailed description of SAGE is given in U.S.Pat. No. 5,695,937, the entire contents of which are incorporated hereinby reference. SAGE allows for the simultaneous quantitative analysis ofa large number of mRNA transcripts. The SAGE method has two steps.First, short sequence tags (10-14 bp) are generated from the mRNA. Eachtag should contain sufficient information to identify a uniquetranscript, provided that the tag is derived from a defined locationwithin that transcript. Second, transcript tags are concatenated into asingle molecule and then sequenced, revealing the identity of multipletags simultaneously. The expression pattern of any population oftranscripts can be quantitatively evaluated by determining the abundanceof individual tags and identifying the gene corresponding to such a tag.The data produced by the SAGE method is a list of tags, with theircorresponding count values, and can be portrayed as a digitalrepresentation of cellular gene expression.

SUMMARY OF THE INVENTION

[0041] Accordingly, it is an object of the present invention to providea method and computer program product for determining biologicalexpression levels from complex, “promiscuous” array hybridization datathat also works well in the “polygamous” case.

[0042] Another object of the present invention is to provide a methodand computer program product for determining biological expressionlevels from data derived from SAGE analysis.

[0043] Another object of the present invention is to provide a methodand computer program product for determining biological expressionlevels from uncertain or noisy probe array hybridization data.

[0044] The above and other objects are achieved according to the presentinvention by providing a method for determining a matrix of expressionlevels corresponding to a set of biological targets and a set ofbiological samples, including obtaining a matrix of signal values Pcorresponding to the set of biological targets; computing a vector ofexpression levels for a sample in the set of biological samples usingthe matrix of signal values P; storing the vector of expression levelscomputed in the computing step in a storage matrix; repeating thecomputing and storing steps for each sample in the set of biologicalsamples; and outputting the storage matrix as the matrix of expressionlevels.

[0045] According to this method, the computing step includes obtaining avector of signal values A corresponding to the sample; determining anonnegative vector E, a nonnegative vector s, and a nonnegative vector tthat minimize a total sum of all elements of s and t, and satisfy aconstraint PE+s−t=A; and setting the vector of expression levels to bethe nonnegative vector E determined in the determining step.

[0046] In addition, further steps are provided for computing the vectorof expression levels when the signal data is uncertain or noisy,including obtaining a vector of lower signal values L and a vector ofhigher signal values H corresponding to the sample, each element of Lbeing less than or equal to a respective element of II; determining anonnegative vector E, a nonnegative vector s, and a nonnegative vectort, that minimize a total sum of all elements of s and t, and satisfyconstraints s≧L−PE and t≧PE−H; and setting the vector of expressionlevels to be the nonnegative vector E determined in the determiningstep.

[0047] An important aspect of the present invention is the formulationof the biological expression determination problem as a problem that canbe solved using linear programming techniques.

[0048] Further, methods are provided for expression identification. Inthis problem, a particular target g gives rise to a certain array ofsignal values P(g). Suppose a set U denotes a “universe” of targets. Inany expression array experiment, some set of targets S, which is asubset of U, will be active, and are to be identified.

[0049] To that end, according to an aspect of the present invention,there is provided a method for identifying a set of biological targetsS=S(A) consistent with a vector of nonnegative signal values A, whereinthe array P(g) is compared to the vector A, for each target g in the setof universal targets U.

[0050] According to another aspect of the present invention, there isfurther provided a method for computing a multiplicity vector D(A) thatcorresponds to the vector of nonnegative signal values A. Each elementof D(A) indicates the number of targets with a nonnegative signal valuein the corresponding position in P(g). This method makes use of the setS(A) to compute the multiplicity vector D(A).

[0051] According to another aspect of the present invention, there isfurther provided a method for identifying the subset of ambiguoustargets ΔS(A) that may be expressed in a vector of nonnegative signalvalues A, but cannot be identified with certainty. This method makes useof both the set S(A) and the multiplicity vector D(A).

[0052] According to another aspect of the present invention, there isfurther provided a method for identifying the set of uniquely expressedbiological targets S(A)\ΔS(A) represented by a vector of nonnegativesignal values A. This set consists of those targets in S(A) that are notin the set ΔS(A).

BRIEF DESCRIPTION OF THE DRAWINGS

[0053] A more complete appreciation of the invention and many of theattendant advantages thereof will be readily obtained as the samebecomes better understood by reference to the following detaileddescription when considered in connection with the accompanyingdrawings, wherein:

[0054]FIG. 1 is a flowchart showing the steps of determining a matrix ofbiological expression levels corresponding to a set of biologicaltargets and a set of biological samples according to a first embodimentof the present invention;

[0055]FIG. 2 is a flowchart showing the steps of determining a vector ofexpression levels for a biological sample according to the firstembodiment of the present invention; and

[0056]FIG. 3 is a flowchart showing the steps of determining a vector ofexpression levels for a biological sample according to a secondembodiment of the present invention.

DESCRIPTION OF THE PREFERRED EMBODIMENT

[0057] Biological arrays have as their digital output an array A=(A_(l))of hybridization values. Even though these often appear in a twodimensional array, it is assumed they are numbered by a single index l.A typical size array today has from a few thousand to as many asthree-hundred thousand cells, and this could be expected to grow in thefuture, up to a million cells [9].

[0058] In a simplified model, these values are either zero or one(expressed or non-expressed: this might be achieved by thresholding realhybridization values). If such a determination could be madeunambiguously, it would provide useful answers to biological questions.At the moment, even this is not a simple matter from a practical pointof view. What is the mathematical structure of such a problem? In such asimplified system, particular targets g would cause a certainhybridization array P(g) to occur. P(g) is an array of hybridizationlevels, but again we can think initially of these as being either zeroor one. This simplification of hybridization levels is sufficient todefine the identification problem.

[0059] To begin with, assume that there is a set U of targets thatdenotes a “universe” of targets. The true “universe” may not becompletely known at the moment, e.g., for gene expression arrays, sincenot all genes have been identified. But it is important to have this setas an underlying concept. In any expression-array experiment, some set Sof targets that is a subset of U will be active, and it is this set thatwe want to identify. If, for some reason, it is known that the size of Ucan be restricted in a given experiment, then that can be done. The datafrom the experiment will be a particular hybridization array A of valuesof zero or one for each given probe. The identification problem can thenbe described as follows:

[0060] Determine the set S⊂U of targets defined by

S=S(A)={g∈U:P(g)≦A}  (1)

[0061] where for two arrays A and B we say B≦A if A_(l) is one wheneverB_(l), is one (equivalently B_(l)≦A_(l) for all l).

[0062] For the human genome, the size of U is expected to be aboutone-hundred thousand (10⁵). It is harder to predict the expected size ofa typical S, but an estimate of 10³ to 10⁴ might be reasonable.

[0063] A solution S(A) to (1) can be computed by the followingalgorithm.

[0064] Algorithm 1 Loop over all g∈U and test whether or not P(g)≦A. Ifit is, include g in S(A).

[0065] Each test P(g)≦A requires a number of operations proportional tothe number of non-zeros in P(g). Thus we have the following result.

[0066] Theorem 1 Algorithm 1 finds a solution to equation (1) in anamount of time that is linearly proportional to the size of the universeof targets U, and the constant of proportionality depends only on theaverage number of non-zeroes in any P(g).

[0067] The notion of P(g) can be generalized to apply to a set:P(S)_(l)=1 if P(g)_(l)=1 for some g∈S, and P(S)_(l)=0 if P(g)_(l)=0 forall g∈S. Note that for any array A and any set S⊂U, we have P(S)≦A ifand only if P(g)≦A for all g∈S. In general, if R⊂S then P(R)≦P(S).

[0068] Algorithm 1 always constructs S(A) so that P(S(A))≦A. Onedifficulty that can arise is that P(S(A))≠A. If this occurs, eitherthere was an error in the experiment, or there was a target that was notrepresented in U. However, the following theorem says that the set S(A)constructed in Algorithm 1 is the largest solution to the identificationproblem.

[0069] Theorem 2. Suppose there is a set R⊂U such that P(R)≦A. ThenR⊂S(A).

[0070] Proof: Suppose that g∈R. Since P(R)≦A, we have P(g)≦A. Hence bydefinition g∈S(A).

[0071] Another problem can arise due to a kind of non-uniqueness. It maybe that there are smaller sets R⊂S such that P(R)=P(S). Any targets inthe difference S\R are possibly being expressed, but they cannot beidentified for sure. However, it is possible to quantify the amount ofnon-uniqueness, as follows.

[0072] In constructing S(A), it is possible to keep track of themultiplicity of each array point l, by which we mean the number of timesthis point had a non-zero value in P(g) for some g∈S(A). This defines anarray D(A) of such values, as follows:

D(A)_(l)=cardinality {g∈S(A):P(g)_(l)=1}.  (2)

[0073] Given an array D(A) of degree values for S(A) define D(A)⁻ to bethe array obtained from D(A) by decrementing each positive array value.Thus the non-zero array values in D(A)⁻ correspond to array values inD(A) that were two or more, that is, ones that are multiply expressed.Now define

ΔS(A)={g∈S(A):P(g)≦D(A)⁻}.  (3)

[0074] Lemma 1 For any g∈S(A)\αS(A) there is at least one array indexl_(g) such that P(g)_(l) _(g) =1 and for all other g′∈S(A) (with g≠g′)we have P(g′)_(l) _(g) =0.

[0075] Proof: For a given g∈S(A), let l_(l), . . . , l_(k) enumerate allof the array indices such that P(g)_(l) _(i) =1; k≧1 by the definitionof S(A). In particular, this means that D(A)_(l) _(i) ≧1 for i=1, . . ., k. If the degree of l_(i) is more than one for all i (i.e., D(A)_(l)_(i) ≧2 for i=1, . . . , k), then g∈ΔS(A), If g∈S(A)\ΔS(A), then for oneof the i's, it must be true that D(A)_(l) _(i) =1, and this means thatfor no other g′∈S(A) can P(g′)_(l) _(i) =1.

[0076] Theorem 3 For any array A and for any set R⊂U such thatP(R)=P(S(A)), then S(A)\ΔS(A)⊂R.

[0077] Proof: To begin with, Theorem 2 implies that it can be assumedthat R⊂S(A). By Lemma 1, if g∈S(A)\ΔS(A) then there is an index l_(g)such that P(g)_(l) _(g) =1 and P(g′)_(l) _(g) =0 for all g′∈S(A) withg≠g′, and a fortiori for all g≠g′∈R.

[0078] If g∉R, then P(g′)_(l) _(g) =0 for all g′∈R. This would implythat P(R)_(l) _(g) =0. By Theorem 2, P(S(A)\ΔS(A))≦P(S(A). SinceP(g)_(l) _(g) =1 implies P(S(A)\ΔS(A))_(l) _(g) =1, it must be true thatP(R)_(l) _(g) =P(S(A))_(l) _(g) =1 by assumption. Since this would be acontradiction, it must be true that g∈R.

[0079] Corollary 1 S(A) is the largest possible set of expressed targetsconsistent with a hybridization array A. S(A)\ΔS(A) is the largest setof uniquely expressed targets represented by A. ΔS(A) contains all theambiguous targets which may be expressed, but cannot be identified withcertainty. If ΔS(A) is empty, the solution is unique.

[0080] Having a set ΔS(A) of ambiguous targets does not mean thatexpression levels cannot be determined correctly. It may be possible todistinguish them due to the fact that the numerical value of theexpression levels is different for different targets. An algorithm fordetermining these is considered below.

[0081] The computation of the array D(A) and the set S(A) can beaccomplished simultaneously by the following modification ofAlgorithm 1. Assume the initial construction of a database of indicesl^(g) _(l), . . . , l^(g) _(k) of all indices such that P(g)_(l) _(i)_(^(g)) =1 for all g∈U. Note that this can be done computationally basedon knowledge of the complementary matches between targets and probes anddoes not require experimental determination.

[0082] Algorithm 2 Set all array values of D(A) to zero. Loop over allg∈U. For all i such that P(g)_(l) _(i) =1, test whether or not A_(l)_(i) =1. If it is, include g in S(A) and increment D(A)_(l) _(i) .

[0083] Once the computation of the array D(A) and the set S(A) iscompleted, the set ΔS(A) can be computed by the following algorithm.

[0084] Algorithm 3 Loop over all g∈S(A). For all i such that P(g)_(l)_(i) =1, test whether or not D_(l) _(i) ≧2. If all of them are, includeg in ΔS(A).

[0085] Theorem 4 Algorithm 2 computes the array D(A) defined in (2) inan amount of time that is linearly proportional to the size of theuniverse of targets U. Algorithm 3 computes ΔS(A) in an amount of timethat is linearly proportional to the size of S(A). The constants ofproportionality depend only on the average number of non-zeroes in anyP(g).

[0086] Simple examples now presented will show how the algorithmspresented above work in practice. If there is only one g∈U such thatP(g)_(l) is non-zero for a given array index l, then A_(l)≠0 impliesthat g∈S(A)\ΔS(A). Since g is unique, it can be referred to as g_(i). Ifthis holds for all Q, then ΔS(A)=φ for any hybridization array A, andthe expression determination is always unique. Note that the number ofsuch that g=g_(i) for a fixed g can be greater than one, which is thepolygamous case.

[0087] The next most complicated case would be when each hybridizationindex 1 has at most two targets g such that P(g)_(l) is nonzero. Forsimplicity, also assume that the hybridization array P(g) has exactlytwo non-zero entries per g. In this case, there is a canonical way tonumber the targets and the hybridization indices that simplifies theexpression presentation and analysis, which is described below.

[0088] Select one g and call it g_(l) and correspondingly number one ofthe array locations by “1” so that P(g₁)₁ is nonzero. Let array indexnumber 2 be the other l such that P(g₁)_(l) is nonzero. ThusP(g_(l))_(l) is nonzero precisely for i=1, 2. Suppose there is another gsuch that P(g)₂ is non-zero. Call that target number 2. Thus bydefinition P(g₂)₂ is now nonzero. If there is another l such thatP(g₂)_(l) is non-zero, let this be called the 3-rd array index.Continuing in this way, the result is a sequence of targets such thatP(g_(j))_(i) is non-zero precisely for i=j, j+1.

[0089] Of course the process could terminate in one of two ways. Firstof all, there may be no other g such that P(g)_(i) is non-zero, sotarget i−1 is uniquely identified by the i-th hybridization signalvalue. In this case, the first i−1 targets can be determined from thefirst i hybridization signal values. Then numbering can start over withthe remaining targets and hybridization signal values following the samealgorithm.

[0090] In the second case, the second hybridization index l withnon-zero P(g_(j−1))_(l) may be previously numbered, in which case thereis a cycle back into the current group. By construction, eachhybridization index i>1 already has two targets g (precisely g_(i) andg_(i−l)) with non-zero hybridization signal values. Since the assumptionis that there are at most two targets g such that P(g)_(l) is non-zero,when a non-zero P(g_(i−l))_(l) is found which is previously numbered, itmust be l=1. Again, the algorithm can be started over with a new g toform a new group. By the assumption on the bound of at most two targetsper hybridization signal value, these groups will all be disjoint. Thismeans that the matrix P_(ij)=P(g_(j))_(i) is a lower bi-diagonal matrix,except for an occasional “loop back” entry.

[0091] Note that array values and targets are numbered using the samenumerical values such that P_(ij) is non-zero for just j=i, i+1 in thegeneric case. For simplicity, consider hybridization array data A thatis zero at the beginning and end of each group as defined above. Then itis easy to identify which parts of U are in S(A) and ΔS(A). Thehybridization array A is made up of a number of connected intervals [i,i+k] in which the hybridization values are non-zero, with zero values inbetween. The interval boundaries identify the targets in the setS(A)\ΔS(A), since they can be determined uniquely. However, the targetscorresponding to interval interiors must be in ΔS(A). Thus, wheneverthere is an interval of three or more consecutive non zero hybridizationvalues, there will be non-determinism in the expression pattern. Asdiscussed below, (non-binary) hybridization signal levels candisabmiguate these expression patterns

[0092] Consider the situation in which the individual expression levelsare not just binary values. Suppose that a probe array has a digitaloutput presented as an array A=(A_(l)) of hybridization signal values,which are non-negative numbers. Similarly, particular targets g willcause a certain hybridization array P(g) to occur. P(g) is an array ofhybridization levels indexed by l, each entry P(g)_(l) a non-negativenumber. It would be a reasonable assumption that the non-zero valuesP(g)_(l) might have the same magnitude for all l. However, this appearsnot to be the case in some situations [7][8].

[0093] In particular, it is quite reasonable to assume that P(g) willhave nonzero values corresponding to hybridizations with base-pairmismatches. Hybridization kinetics for oligos is well understood insolution [3][16], but the details of hybridization at surfaces remains aproblem of interest. The approach presented here allows expressiondetermination even in the case when complex hybridization occurs.

[0094] The expression determination problem can then be described asfollows:

[0095] Determine a set {e_(g)≧0:g∈U} of target expression values definedby $\begin{matrix}{{\sum\limits_{g \in U}^{\quad}{e_{g}{P(g)}_{l}}} = {A_{l}{\forall{l.}}}} & (4)\end{matrix}$

[0096] Using vector and matrix notation, this can be simplified. DefineP to be the matrix indexed by array indices l and by g∈U with valueP(g)_(l). That is

P _(l,g) =P(g)_(l)  (5)

[0097] Define E to be a vector with components e_(g) indexed by g∈U.Then the expression determination problem is as follows:

[0098] Find a vector E of gene expression values defined by “solving”

PE=A, E≧0  (6)

[0099] where E≧0 means that e_(g)≧0 for all g∈U.

[0100] Unfortunately, there need be no “solution” to (6) for arbitraryarrays A≧0 due to the positivity constraint, E≧0. The domain space(e.g., the set of possible E's) of the matrix P is the size n of the setU, and the range or image set (e.g., the set of possible A's) is of sizek of the array. In the polygamous approach discussed earlier, k>>n,meaning that the system PE=A is an over-determined system. A necessarycondition for solutions to PE=A to exist is that A be in the range of P,and this corresponds to k−n constraints which must be satisfied.

[0101] There is a simple restriction on the matrix P that will occurlater, but described now for clarity. If a column of P is identicallyzero, say the column indexed by a particular g∈U, then this would meanthat P(g)_(l)=0 for all l, which in turn means that there is no probethat identifies g. Clearly this means that predictions about expressionlevels of g can not be made. In some sense, this would mean that g isnot in the “universe” of targets that can be probed. Thus, a naturalcondition to impose on P (or more precisely, on the “universe” U) isthat no column of P is identically zero.

[0102] In the promiscuous approach discussed earlier, it is reasonableto assume that n>>k. This means that the system PE—A is anunder-determined system. A sufficient condition for solutions to PE=A toexist is that P has full rank. The set of solutions to PE=A is eitherempty or else an affine set of high dimension (dimension n−k in the caseof P full rank, even higher when P is rank deficient). It is possiblethat none of the elements of this set satisfy the constraint E≧0. Thefollowing algorithm does not assume that P has full rank; solutions inthis case exist only for data satisfying constraints similar to those inthe over-determined case.

[0103] A preferred method of the present invention is based on linearprogramming (LP) to determine whether the feasibility problem (6) has asolution, and to find such a solution if one exists. The initial linearprogramming formulation introduces vectors s and t, each containing kartificial variables indexed by l. The formulation is as follows:

min_(E,s,t)Σ_(l) S _(l) +t _(l) subject to  (7)

(E,s,t)≧0, PE+s−t=A.  (8)

[0104] Note several points about this LP. First, given any guess Esatisfying E≧2 0, a feasible starting point can be constructed bysetting

s _(l)=−min((PE−A)_(l), 0), t_(l)=max((PE−A)_(l), 0).  (9)

[0105] Second, the objective value is bounded below by 0, because of theconstraints that all components of s and t remain nonnegative. Thus theLP is feasible and bounded, and so it has a solution. Third, note thatany solution E of the original system (6) can be transformed into asolution of (7-8) by setting s=t=0. Using this fact, a certificate offeasibility for (6) can be obtained by solving (7-8); if the solution of(7-8) results in a strictly positive optimal value, then there exists nosolution to (6). On the other hand, if the solution of (7-8) yields anobjective value of zero, a solution to (6) is obtained by simply takingthe E component of the solution to the LP. Finally, note that noassumptions on P (or k or n) are needed. These facts are collected inthe following result.

[0106] Theorem 5 For any P and A, the minimization problem (7-8) alwayshas a solution and can be started at the feasible point given by (9) forany E. The minimum value in (7) is zero if and only if the E componentof the minimizer is a solution of (6).

[0107] One can show that there exists a solution of (7-8) that has atmost k nonzeros in the solution vector (E, s, t), and that the simplexmethod will find such a solution.

[0108] If in fact (6) is infeasible, the solution of (7-8) yields anonnegative vector E for which the inconsistency in the equations PE=Ais minimized in the 1-norm.

[0109] A reasonable choice for initial E is the array E(A) defined byE(A)_(g)=1 (or a predetermined positive value) if g∈S(A)\ΔS(A) (seeTheorem 7).

[0110]FIG. 1 lists the steps in the preferred method for determining amatrix of expression levels corresponding to a set of biological targetsand a set of biological sample.

[0111] In step 101, a matrix of hybridization values P corresponding tothe set of biological targets is obtained.

[0112] Next, in step 102, a vector of expression levels for a sample inthe set of biological samples using the matrix of hybridization values Pis computed.

[0113] In step 103, the vector of expression levels computed in thecomputing step is stored in a storage matrix.

[0114] In step 104, if expression levels have been computed for all ofthe samples in the set of biological samples, the method proceeds tostep 105. Otherwise, steps 102 and 103 are repeated.

[0115] Finally, in step 105, the storage matrix is outputted as thematrix of expression levels.

[0116]FIG. 2 lists the steps in the preferred method for computing thevector of expression levels for a sample in the set of biologicalsamples using the matrix of hybridization values P.

[0117] In step 201, a vector of hybridization values A corresponding tothe sample is obtained.

[0118] Next, in step 202, a nonnegative vector E, a nonnegative vectors, and a nonnegative vector t that minimize a total sum of all elementsof s and t, and satisfy a constraint PE+s−t=A are determined. A linearprogramming algorithm, such as the simplex method, can be used at thisstep.

[0119] Finally, in step 203, the vector of expression levels is set tobe the nonnegative vector E determined in step 202.

[0120] It is often the case that probes are duplicated on a singlebiological expression array as a way of detecting hybridization errors.In principle, all duplicate probes should have the same hybridizationsignal levels, but in practice they do not. Different protocols can beobserved to deal with the multiple data. One approach would be toaverage the values, and another would be to “vote” on the best value.Either of these, or some other method, could be used to define a singlehybridization signal value for each probe, and the previous algorithmscan be used.

[0121] Consider a more general formulation of the problem of determiningexpression levels, which is applicable when the correct values of someof the components of A_(l) is uncertain. Suppose that instead of aprecise value A_(l), there is an interval [L_(l), H_(l)] that containsthe value. In other words, accept E as a valid vector of expressionlevels if$L_{l} \leq {\sum\limits_{g \in U}^{\quad}{e_{g}{P(g)}_{l}}} \leq {H_{l}.}$

[0122] (If the precise value is known, simply set both L_(l) and H_(l)equal to A_(l).) The formulation analogous to (7-8) is then

min_(E,s,t) Σ _(l) s _(l) +t _(l) subject to  (10)

(E, s, t)≧0, s≧L−PE, t≧PE−H.  (11)

[0123] This formulation would in general be no more difficult to solvethan (7-8).

[0124]FIG. 3 lists the steps in a method for computing the vector ofexpression levels for a sample in the set of biological samples usingthe matrix of hybridization values P, according to a second embodimentof the present invention.

[0125] In step 301, a vector of lower hybridization values L and avector of higher hybridization values H is obtained by testing thebiological sample with an array of reactive probes, each element of Lbeing less than or equal to a respective element of H.

[0126] In step 302, a nonnegative vector E, a nonnegative vector s, anda nonnegative vector t, that minimize a total sum of all elements of sand t, and satisfy constraints s≧L−PE and t≧PE−H are determined. Alinear programming algorithm, such as the simplex method, can be used atthis step.

[0127] Finally, in step 303, the vector of expression levels is set tobe the nonnegative vector E determined in step 302.

[0128] Returning to the formulation (7), consider the case in which theset (6) is nonempty. As mentioned above, the E component of any solutionof (7-8) that has s=t=0 is a member of this set. In general, the set ofvectors E satisfying (6) will be a (possibly unbounded) polytope. Itmay, however, be of interest to obtain some particular solution from theoptimal set, or to learn something about the properties of this set.

[0129] To begin with, observe that in this case, the set of vectors Esatisfying (6) will be a bounded polytope.

[0130] Theorem 6 Suppose that the matrix P has non-negative entries andthat none of its columns are identically zero. Then the set (6) isbounded.

[0131] Proof. The set (6) is unbounded if and only if there is a vectorF=[f_(g)]_(g∈U) such that

F≧0, PF=0, F≠0.  (12)

[0132] Suppose some component f_(g) of F is strictly positive. By theassumptions, there is an index l such that P(g)_(l) is strictlypositive. The l-th component of PF is at least P(g)_(l)f_(g), which isstrictly positive, contradicting PF=0. Hence no such component f_(g)exists, and F=0. Thus, no solution to (12) exists, and hence the set (6)is bounded.

[0133] It may be useful to find the solution of minimum-norm in casethere exist multiple solutions. One can minimize the 1-norm by solvingthe following LP: $\begin{matrix}{{{\min {\sum\limits_{g \in U}^{\quad}{e_{g}\quad {subject}\quad {to}\quad E}}} \geq 0},{{PE} = A},} & (13)\end{matrix}$

[0134] using the E component of the solution obtained from (7-8) as afeasible starting point for (13). Using a Euclidean norm criterion, onecan solve $\begin{matrix}{{{\min \frac{1}{2}{\sum\limits_{g \in U}^{\quad}{e_{g}\quad {subject}\quad {to}\quad E}}} \geq 0},{{PE} = {A.}}} & (14)\end{matrix}$

[0135] This is a convex quadratic program [1] rather than a linearprogram, but it can still be solved with interior-point software (seeChapter 8, “Primal-dual interior-point methods” in NumericalOptimization by Nocedal and Wright [15]). Variants of (13) and (14) thatuse weighted norms or the ∝-norm can be formulated easily.

[0136] One could also seek extreme points of the set (6) by solvinglinear programs with random linear objectives. For instance the problem$\begin{matrix}{{{\min {\sum\limits_{g \in U}^{\quad}{x_{g}e_{g}\quad {subject}\quad {to}\quad E}}} \geq 0},{{PE} = A},} & (15)\end{matrix}$

[0137] where each x_(g) is a random variable (drawn from somedistribution that allows both positive and negative values) will yield avertex of the optimal polytope. By solving a sequence of problems of theform (15), for different choices of x, one can construct a subset of theoptimal polytope by taking the convex hull of the solutions obtained inthis process. By inspecting this subset, one may for instance be able torestrict the ambiguity produced by the data to certain subspaces. Notingthat certain components e_(g) of the solution are identical (or nearlyidentical) regardless of the choice of x, one might conclude that thesecomponents are well determined by the data.

[0138] There is an intriguing connection between the discrete model (theidentification problem) and the continuous (expression-leveldetermination) problem. The sets S and ΔS allow us to determine theactive set for the minimization problem of equations (7-8). But firstconsider modified definitions suitable for the more general expressionvalues being considered here.

[0139] Generalizing the definition in equation (1), define

S=S(A)={g∈U:P(g)_(l)>0→A _(l)>0∀l }.  (16)

[0140] The solution S(A) to (16) can still be computed by a simplemodification to Algorithm 1. Similarly, equation (2) can be replaced bythe more general definition

D(A)_(l)=cardinality {g∈S(A):P(g)_(l)>0}  (17)

[0141] while keeping the definition (3) of ΔS(A). These can be computedby simple modifications of Algorithms 2 and 3. With these newdefinitions, one obtains the following modification of Lemma 1.

[0142] Lemma 2 For any g∈S(A)\ΔS(A) there is at least one array indexI^(g) such that P(g)_(l) _(g) >0 and for all other g′∈S(A) (with g≠g′),P(g′)_(l) _(g) >0.

[0143] Using this Lemma, one can now prove the following result whichcharacterizes the active set for a minimization problem for solving (6).Note that the solution (E, s, t) to (7-8) satisfies PE=Â=A−s+t.

[0144] Theorem 7 Suppose there exists an expression level vector E suchthat

PE=A, E≧0.  (18)

[0145] Then

S(A)\ΔS(A)⊂{g∈U\e _(g)>0}⊂S(A).  (19)

[0146] Proof. Consider first the right inclusion in (19). Suppose forcontradiction that e_(f)>0 for some f∈U\S(A). Then by definition of S(A)there exists an index k such that

P(f)_(k)>0 and A _(k)=0.

[0147] From (18), it follows that${0 = {A_{k} = {{\sum\limits_{h \in U}^{\quad}{{P(h)}_{k}e_{h}}} \geq {{P(f)}_{k}e_{f}} > 0}}},$

[0148] giving the desired contradiction.

[0149] To prove the left inclusion in (19), take an arbitraryg∈S(A)\ΔS(A) and try to show that e_(g)>0. By Lemma 2, there is an indexl such that

P(g)_(l)>0, while P(h)_(l)=0 for all h∈S(A)\{g}.  (20)

[0150] In particular, since g∈S(A), it follows from (16) that

A _(l)>0.  (21)

[0151] In addition, it is true that e_(h)=0 for all h∈U\S(A), by theargument above. By (18) and (20), it follows that for this index l$\begin{matrix}{{0 < A_{l}} = {\sum\limits_{h \in U}^{\quad}{P(h)}_{{le}_{h}}}} \\{= {{P(g)}_{{le}_{g}} + {\sum\limits_{h \in {{S{(A)}}\backslash {\{ g\}}}}^{\quad}{P(h)}_{{le}_{h}}} + {\sum\limits_{h \in {U\backslash {S{(A)}}}}^{\quad}{P(h)}_{{le}_{h}}}}} \\{= {P(g)}_{{le}_{g}}}\end{matrix}$

[0152] Therefore e_(g)>0 as required, proving the result.

[0153] This result says that, if there is a solution to (6), then anyalgorithm for finding it may be restricted by assuming that e_(g)=0 forg∉S(A). That is, compute S(A) first, and then start looking for asolution E “supported” in S(A). Moreover, one can be sure that all e_(g)are positive for g∈S(A)\ΔS(A). For example, restrict the minimizationproblem in equations (7-8) to E∈S(A) only.

[0154] One question of interest is what the linear programming modeldoes when there is a hybridization array that relates to a targetexpression pattern for a target not in the universe U. Suppose thatthere is some g∉U with hybridization pattern B=P(g). If there is anhybridization pattern A⁰=PE^(o) perturbed by adding B to get anhybridization pattern A=A^(o)+B, then the linear programming model (7-8)will produce an answer (E, s, t) with

PE+s−t=A=A ⁰+B=PE^(o)+B.  (22)

[0155] Thus the error E−E^(o) satisfies the error equation

P(E−E ⁰)=B−s+t.  (23)

[0156] Similarly, it is useful to know what the linear programming modelwill produce if there is a perturbation, e.g., due to noise. But theformulation reduces to the same considerations in which B is interpretedas a perturbation due to noise.

[0157] The algorithm of equations (7-8) is a minimization problem, andthe vector (E⁰, s^(o), t^(o)) is a feasible approximant, wheres^(o)−t^(o)=B, that is

s ^(o) _(l)=max (B _(l), 0), t ^(o) _(l=−min () B _(l), 0).  (24)

[0158] Note that

PE ^(o) +s ^(o) −t ^(o) =A ^(o) +s ^(o) −t ^(o) =A ^(o) +B=A.  (25)

[0159] By the optimality condition of (7-8)

∥(s,t)∥_(l) _(¹) ≦∥(s ^(o) , t ^(o))∥_(l) _(¹) =∥B∥ _(l) _(¹) .  (26)

[0160] Thus the role of s and t can be interpreted as making thehybridization A=A^(o)+B−s+t satisfy the implicit constraint needed sothat the system (6) has a solution E≧0 with right hand side A. One boundon the size of s and t is that s and t need be no bigger than requiredsimply to cancel B completely, but s and t can be smaller as well. Thuss and t provide a lower bound for the size of the (unknown) perturbationB.

[0161] In the case that the matrix P is invertible, this means thatwhatever the perturbation B there will be a unique expression Eattributed to it. The only indicator of error is the size of s and t.

[0162] The lower bound (26) means that the size of ∥(s, t)∥_(l) _(¹) isa conservative estimation of the size of the hybridization error. Thusit may underestimate the error, but it will never overestimate it.Otherwise said, if it is large, then there is definitely a largediscrepancy in the data, and it should not be trusted.

[0163] Recall that the error E−E^(o) satisfies the error equation (23),namely, P(E−E^(o))=B−s+t. But then (26) implies that

∥B−s+t∥_(l) _(¹) ≦∥(s, t)∥_(l) _(¹) +∥B∥ _(l) _(^(1≦2)) ∥B∥ _(l) _(¹).  (27)

[0164] In the case that P is invertible, this allows us to give a boundon the expression error E−E^(o).

[0165] Previously, two examples of possible expression array matriceswere presented. In the first example there is only one g∈U such thatP(g)_(l) is non-zero for a given array index l, which is referred to asg_(l). One can number the targets and probes in such a way that theexpression matrix P_(ij)=P(j)_(i) has a simple form. Let the firsttarget index j=1 correspond to some g∈U for which P(g)_(l) is non-zerofor some l. There is some number I₁≧1 of array indices l such thatP(g)_(l) is non-zero. Number these array indices 1, . . . , I₁. Now pickanother g∈U for which P(g)_(l) is non-zero for I₂≧1 array indices l, andcall this target j−2 and number these array indices I₁+1, . . . , I₁+I₂.Continuing in this fashion, one constructs a matrix P with only onenon-zero per row, such that the j-th column consists of I₁+I₂+ . . .I_(j−1) zeros, followed by I_(j) non-zeros, and then followed by allzeros. For simplicity, assume that all the non-zeros are ones.

[0166] As shown above, expression determination is always unique in thiscase. However, it is interesting to consider how the algorithm (7-8)deals with this case, especially in the presence of noisy data. The roleof the extra variables s and t in this case is to make sure that thehybridization array A−s+t is in the range of the matrix P. The range ofP is easy to describe. It consists of vectors A such that the entriesI_(j)+1, . . . , I_(j+1) all have the same value, for each j (forcompleteness, define I_(o)=0). For any A not satisfying this constraint,algorithm (7-8) will adjust the variables s and t to make the vectorA−s+t satisfy this property.

[0167] It suffices to see what happens with a single block j, soconsider the case when P is a k=I₁ by 1 matrix (one expression valueonly, with k hybridization array values). For simplicity, assume thatthe first k−1 hybridization values of A are α<0 and the k-th value isβ≦α. This would correspond to a simple error in one of the hybridizationarray values. Then it is easy to see that the optimal vectors s and twill have the following form. The first k−1 values of s will be somevalue σ and the k-th value will be zero. The first k−1 values of t willbe zero, and the k-th value will be some value τ. To have A−s+t be inthe range of P, it must be true that α−σ=β+τ. Thus

Σ_(i)(s _(i) +t _(i))=σ(k−1)+τ=σ(k−2)+α−β.  (28)

[0168] If k>2, this is minimized by taking σ=0 (and so τ=α−β) whichcorresponds to the expression value obtained by the “voting” algorithm:the consensus k−1>1 values a are confirmed. In the case k=2 (when thereis a tie), the resulting solution adjusts the array so that A−s+tcorresponds to the average (α+β)/2. Thus the optimization algorithm(7-8) does a very reasonable thing in this case.

[0169] When the data is more complex, the effect of the optimizationalgorithm (7-8) is more complicated to describe. The resulting“consensus value” will be one of the array values, the one thatminimizes the L_(l) norm of the deviation from the other values. FIG. 1shows how this would work with some synthetically generated random data.As shown in FIG. 1, the algorithm does a good job of “healing” errorsintroduced by noise.

[0170] In the second example above the case in which there are at mosttwo targets g such that P(g)_(l) is nonzero for any l, and for whichthere are exactly two non-zero hybridization levels per target wasconsidered. In this case, the expression matrix P_(i),_(j)=P(j)_(i) is alower bi-diagonal matrix, except for an occasional “loop back” entry.Each of these “loop backs” marks a block of independent expression, sofocus on just one. For simplicity, just assume that P=(p_(ij)) is of theform p₁₁=1, P_(i),_(i)−1=P_(i),_(l)=1 for all i=2, . . . , n and therest are zero. Expression analysis in this case is equivalent to solvingthe equation PE=A, and it is possible to invert the matrix P explicitly.Let Q=(q_(ij)) be the lower-triangular matrix whose i-th row satisfiesq_(i−j)=(−1)^(i−j). Then Q is the inverse of P. Thus expression levelscan be determined from E=QA, provided these values are non-negative.

[0171] One thing this example makes clear is that the ambiguityresulting from ΔS(A)≠φ above, which occurs for any array data with threeor more consecutive non-zero values, does not lead to an inability todetermine expression levels. Any hybridization array A will yield anunambiguous expression E=QA. Thus utilizing (non-binary) expressionlevels leads to a more robust identification system.

[0172] Having an explicit inverse for P allows one to study the effectof errors in the data. Suppose an array A is perturbed by ∈=(∈_(i)). Theresulting expression values are given by Ê=Q(A=∈)=E+Q∈, provided thesevalues are positive. Suppose that ∈_(i)=δ(−1)^(i−j). Then(Q∈)_(i)=δ_(i). This means that the error can be as large as the numberof expression values times the perturbation. Otherwise said, the inversematrix Q can amplify the error significantly.

[0173] One problem with the current approach is the limitation to twonon-zero hybridization array locations. It is certainly possible to pickoligos which can target an essentially arbitrary number of genes [4].There is another simple expression matrix one can examine. Suppose thatthe i-th probe identifies the first i targets. This corresponds toassuming that p_(ij)=1 for all j=1, . . . , i for i=1, . . . , n. Inthis case the inverse matrix Q has the simple form q₁₁=1, q_(i,i)=−1,and q_(i,i)=1 for all i=2, . . . , n.

[0174] In this example, it is possible to characterize the arrays A forwhich the corresponding expression levels E are non-negative. Thecondition E=QA≧0 can be used inductively as follows. First of all,0≦e_(l)=α_(l), but this only says that the hybridization array value αlshould not be negative. But for i>1, 0≦e_(i)=α_(i−α) _(i−1), and thissays that the array values should not decrease in this ordering of thevalues. This is reasonable since each succeeding value represents moreand more targets.

[0175] Previously, an analysis of what the linear programming model doeswhen there is an array A=PE^(o)+B for some perturbation B wasconsidered. The algorithm (7-8) produces a solution (E, s, t) where

PE=A−s+t.  (29)

[0176] Then the role of s and t can be interpreted as making the vectorA−s+t satisfy the constraint needed so that E≧0. In the example, thismeans that A−s+t must be non-decreasing; recall that A−B=PE^(o) must benon-decreasing to begin with, but there is no reason that A would be.

[0177] Consider now the situation when targets in the same experimenthave been labeled in different ways, e.g., so the targets appear on thearray with different colors. For example, targets could be fromdifferent samples: a “normal” sample might be labeled with a greenfluorescence label and another sample might be labeled with a red label.If both types of samples are present in equal amounts, the resultingcolor will appear yellow.

[0178] From a mathematical point of view, the expression levels aresimply vectors, with one component for each color. Let us assume thatthere is some number c≧1 of colors. In the more general, multi-colorcase, one supposes that the probe array has a digital output presentedas an array A=(A_(l)) of hybridization vectors, where each componentA_(l) ^(i) is a non-negative number, for i=1, . . . c.

[0179] Particular targets g will still generate a hybridization arrayP(g) of scalar hybridization values for each l, again with non-negativecomponents. The color would be determined by the marking, but theresponse would presumably be the same independent of the marking color,or at least that is the assumption. Let an expression vector e_(g)≧0 ife_(g) ^(i≧)0 for all i=1, . . . , c.

[0180] The multi-color expression determination problem is then asfollows:

[0181] Determine a set {e_(g)≧0:g∈U} of gene expression vectors definedby “solving” $\begin{matrix}{{\sum\limits_{g \in U}^{\quad}{e_{g}{P(g)}_{l}}} = {A_{l}\quad {\forall{l.}}}} & (30)\end{matrix}$

[0182] Using vector and matrix notation, this can be simplified asbefore. Define P to be the matrix indexed by array indices l with valueP(g)_(l)and by g∈U. Define E to be a vector-valued array indexed by g∈U,and define E≧0 to mean that e_(g)≧0 for all g∈U. Then the multi-colorexpression determination problem is as follows:

[0183] Determine an array E≧0 of gene expression vectors defined by“solving”

PE=A.  (31)

[0184] This is really just c independent problems, of the form

PE ^(i)=A_(i) , E _(i)≧0.  (32)

[0185] for i=1, . . . c. Thus the techniques and results discussed aboveapply. If the original data for the problem is an array A of colorvalues, these will have to be decomposed into constituent colors A^(i),i=1, . . . , c. But after that, the problem can be solved by techniquesdeveloped for the single color case.

[0186] Algorithms have been presented based on linear programming thatdetermine expression values from arrays of biological experiments withcomplex relationships between probes and targets. The algorithms havebeen analyzed abstractly and bounds have been given to relate certaincomputed quantities to hybridization error. They have been shown to workfor both “promiscuous” array data, in which there may be multipletargets indicated by a single probe, as well as in the “polygamous” casewhere there are multiple probes for a single target.

[0187] In an alternative embodiment, the method for determiningbiological expression levels can be used with Serial Analysis of GeneExpression (SAGE™) data [4][24][25][27]. In standard SAGE analysis,probes are represented by SAGE tags, n-mer sequences following a givenmarker sequence, such as CATG. The targets are the genes that containthem. Of course, there can be many such genes [24][4]. Using thenotation described above, the corresponding matrix entry P(g)_(l) isnon-zero if the l-th SAGE tag is in the gene (or EST) g, and zerootherwise. For simplicity, one can take the non-zero entries to be equalto one. Multiple matches correspond to a row of P with multiplenon-zeros. However, in any given column, there will be at most onenon-zero entry. Thus, there is no way to distinguish differentexpression levels with standard SAGE analysis using the method of thepresent invention described above.

[0188] However, if multiple SAGE analyses are performed with multiplemarkers (e.g., CTAG, etc.), one can get multiple non-zero entries incolumns of P. This provides a set of equations that are amenable to themethod of the present invention. Note that there is no concept ofmis-hybridization with SAGE (the error rates in sequencing are verysmall). However, it might be that different markers (CTAG versus CATG)would have different affinity levels, leading to different non-zerocoefficients in P. Note that in multi-SAGE analysis, the index 1 wouldbe different for different markers. That is, CATGAACCGGTTAA is differentfrom CTAGAACCGGTTAA. There are potentially sixteen different 4-merpalindromes available to use as markers in multi-SAGE analysis. Thiswould lead to having up to sixteen non-zero elements in each column ofthe matrix P.

[0189] It will be appreciated from the foregoing that the presentinvention represents a significant advance over other systems andmethods for determining biological expression levels. It will also beappreciated that, although a limited number of embodiments of theinvention have been described in detail for purposes of illustration,various modifications may be made without departing form the spirit andscope of the invention. Accordingly, the invention should not be limitedexcept as by the appended claims.

1. A method for determining a matrix of expression levels correspondingto a set of biological targets and a set of biological samples,comprising: obtaining a matrix of signal values P corresponding to theset of biological targets; computing a vector of expression levels for asample in the set of biological samples using the matrix of signalvalues P; storing the vector of expression levels computed in thecomputing step in a storage matrix; repeating the computing and storingsteps for each sample in the set of biological samples; and outputtingthe storage matrix as the matrix of expression levels.
 2. The method ofclaim 1, wherein the computing step comprises: obtaining a vector ofsignal values A corresponding to the sample; determining a nonnegativevector E, a nonnegative vector s, and a nonnegative vector t thatminimize a total sum of all elements of s and t, and satisfy aconstraint PE+s−t=A; and setting the vector of expression levels to bethe nonnegative vector E determined in the determining step.
 3. Themethod of claim 2, wherein obtaining the vector of signal values Acomprises: testing the sample with an array of reactive probes.
 4. Themethod of claim 2, wherein obtaining the vector of signal values Acomprises: testing the sample with an array of reactive probes, at leastone probe in the array of reactive probes being indicative of more thanone target in the set of biological targets.
 5. The method of claim 2,wherein obtaining the vector of signal values A comprises: testing thesample with an array of reactive probes, each probe in the array ofreactive probes being a sequence of oligonucleotides.
 6. The method ofclaim 2, wherein obtaining the vector of signal values A comprises:testing the sample with an array of reactive probes including at leastone probe which is a sequence of oligonucleotides from at least one partof a known gene.
 7. The method of claim 2, wherein the determining stepcomprises: identifying a set of biological targets S(A) consistent withthe vector of nonnegative signal values A; and setting, for each targetin the set of biological targets that is not in S(A), a respectiveelement of the nonnegative vector E to zero.
 8. The method of claim 2,wherein the determining step comprises: using a linear programmingalgorithm to determine the nonnegative vector E, the nonnegative vectors, and the nonnegative vector t.
 9. The method of claim 8, wherein thedetermining step further comprises: constructing a feasible startingpoint for the linear programming algorithm, for an initial nonnegativevector E_(o), by initializing an l-th element of the vector s to bes_(l)=−min((PE_(o)−A)_(l), 0), and initializing an l-th element of thevector t to be t_(l)=−max((PE_(o)−A),_(l),0).
 10. The method of claim 9,wherein the determining step further comprises: constructing the initialnonnegative vector E_(o), by setting each element of E_(o) to be eitherzero or, if a corresponding biological target could have uniquelycontributed to the vector of signal values A, to a predeterminedpositive value.
 11. The method of claim 2, wherein obtaining the vectorof signal values A comprises: generating a plurality of short nucleotidesequence tags from the sample; concatenating the plurality of shortnucleotide sequence tags into a single molecule; sequencing the moleculeto determine a count for each tag in the plurality of short nucleotidesequence tags; and mapping the counts determined in the sequencing stepinto the vector of signal values A.
 12. The method of claim 1, whereinthe matrix of signal values P obtained in the obtaining step includesmore columns than rows.
 13. The method of claim 1, wherein obtaining thematrix of signal values P comprises: testing each target in the set ofbiological targets with an array of reactive probes.
 14. The method ofclaim 1, wherein obtaining the matrix of signal values P comprises:testing each target in the set of biological targets with the array ofreactive probes, each target being a gene or a gene fragment.
 15. Themethod of claim 1, wherein the computing step comprises: obtaining avector of lower signal values L and a vector of higher signal values Hcorresponding to the sample, each element of L being less than or equalto a respective element of H; determining a nonnegative vector E, anonnegative vector s, and a nonnegative vector t, that minimize a totalsum of all elements of s and t, and satisfy constraints s≧L−PE andt>PE−H; and setting the vector of expression levels to be thenonnegative vector E determined in the determining step.
 16. The methodof claim 1, wherein obtaining the matrix of signal values P comprises:generating a plurality of short nucleotide sequence tags for a target inthe set of biological targets; concatenating the plurality of shortnucleotide sequence tags into a single molecule; sequencing the moleculeto determine a count for each tag in the plurality of short nucleotidesequence tags; mapping the counts determined in the sequencing step intothe matrix of signal values P; and repeating the generating,concatenating, sequencing, and mapping steps for each target in the setof biological targets.
 17. A method for determining a vector ofexpression levels, comprising: obtaining a vector of signal values Acorresponding to a biological sample; obtaining a matrix of signalvalues P corresponding to a set of biological targets; determining all Nnonnegative vectors satisfying an equation PE=A; outputting, if N=1, thenonnegative vector determined in the determining step as the vector ofexpression levels; computing, if N=0, a nonnegative vector E* thatminimizes an L₁ norm of an inconsistency in the equation PE=A, andoutputting the nonnegative vector E* as the vector of expression levels;and selecting and outputting, if N>1, one of the N nonnegative vectorsdetermined in the determining step as the vector of expression levels.18. The method of claim 17, wherein the selecting and outputting stepcomprises: choosing one of the N nonnegative vectors determined in thedetermining step that minimizes a vector norm, the vector norm being atotal sum of all elements.
 19. The method of claim 17, wherein theselecting and outputting step comprises: choosing one of the Nnonnegative vectors determined in the determining step that minimizes avector norm, the vector norm being a total sum of the squares of allelements.
 20. A method for identifying a set of biological targets S(A)consistent with a vector of nonnegative signal values A, comprising:obtaining a vector of nonnegative signal values P(g) for a target g in aknown universe of biological targets U; determining if, for everypositive element of P(g), whether a respective element of A is alsopositive, and if so, including the target g in the set of biologicaltargets S(A); repeating the obtaining and determining steps for eachtarget g in the known universe of biological targets U; and outputtingthe set of biological targets S(A).
 21. A method for identifying a setof uniquely expressed biological targets consistent with a vector ofnonnegative signal values A, comprising: identifying a set of targetsS(A) consistent with the vector of nonnegative signal values A;identifying a set of ambiguous biological targets DS(A) that may beexpressed in the vector of nonnegative signal values A, but cannot beidentified with certainty; and outputting those targets that are inS(A), but not in DS(A), as the set of uniquely expressed biologicaltargets consistent with the vector of nonnegative signal values A. 22.The method of claim 21, wherein identifying the set of ambiguousbiological targets DS(A) comprises: computing a multiplicity vector D(A)corresponding to A; obtaining a vector of nonnegative signal values P(g)for a target g in S(A); including the target g in DS(A) if, for eachpositive element of P(g), a respective element of D(A) is at least two;and repeating the obtaining and including steps for every target g inS(A).
 23. The method of claim 22, wherein computing the multiplicityvector D(A) comprises: initializing D(A) to all zeros; obtaining avector of nonnegative signal values P(g_(o)) for a target go in S(A);incrementing by one those elements of D(A) that correspond to positiveelements of P(g_(o)); and repeating the obtaining a vector ofnonnegative signal values P(g_(o)) step and the incrementing step foreach target g_(o) in S(A).
 24. A computer program product configured tostore plural computer program instructions which, when executed by acomputer, causes the computer to determine a matrix of expression levelscorresponding to a set of biological targets and a set of biologicalsamples, by performing plural steps comprising: obtaining a matrix ofsignal values P corresponding to the set of biological targets;computing a vector of expression levels for a sample in the set ofbiological samples using the matrix of signal values P; storing thevector of expression levels computed in the computing step in a storagematrix; repeating the computing and storing steps for each sample in theset of biological samples; and outputting the storage matrix as thematrix of expression levels.
 25. The computer program product as claimedin claim 22, wherein the computing step comprises: obtaining a vector ofsignal values A corresponding to the sample; determining a nonnegativevector E, a nonnegative vector s, and a nonnegative vector t thatminimize a total sum of all elements of s and t, and satisfy aconstraint PE+s−t=A; and setting the vector of expression levels to bethe nonnegative vector E determined by the preceding determining step.26. A system configured to determine a matrix of expression levelscorresponding to a set of biological targets and a set of biologicalsamples by performing the steps recited in any one of claims 1-16.
 27. Asystem configured to determine a vector of expression levels byperforming the steps recited in any one of claims 17-19.